Week 01: Introduction, tidyverse, R Markdown

2020-Jan-13 (updated: 2019-12-11)

Course Intro

Today’s topics

Data Visualization

Important in science, business, journalism, etc.

Two key roles:

  1. Communication
  2. Analysis

Summary statistics never give the full picture:

ggplot: Grammar of Graphics

Many graphing programs treat data visualization like painting picture: red circle at \(x_1,y_1\), blue square at \(x_2,y_2\), etc.

This is inefficient and very easy to make mistakes.

ggplot: Grammar of Graphics

Many graphing programs treat data visualization like painting picture: red circle at \(x_1,y_1\), blue square at \(x_2,y_2\), etc.

This is inefficient and very easy to make mistakes.

The ggplot way: a graph has 3 core elements

  1. A data set
  2. A set of mappings between variables in the data set and properties (aesthetics) of the graph
  3. Layers of geoms to instantiate those mappings

A simple example

A basic line plot:

library(ggplot2)
ggplot(Orange, aes(x=age, y=circumference, color=Tree)) + 
  geom_line()

A simple example

Add points

ggplot(Orange, aes(x=age, y=circumference, color=Tree)) + 
  geom_line() + geom_point()

Notice that the lines and points have matching colors, but you only need to specify the color mapping once.

All geoms will “inherit” the same aesthetic mappings unless otherwise specified. This helps maintain consistency.

Compute summaries on the fly

ggplot(Orange, aes(age, circumference)) + 
  geom_point() +
  stat_summary(fun.y=mean, geom="line")

Do not need to have different data sets for all data and summarized data.

Can just use one data set and visualize it in different ways:

This makes exploratory visualization much easier.

Exclude cases on the fly

ggplot(subset(Orange, Tree != "5"), aes(age, circumference)) +
  geom_point() +
  stat_summary(fun.y=mean, geom="line")

This also helps exploratory visualization.

Small multiples, aka facets

ggplot(Orange, aes(age, circumference)) + 
  facet_wrap(~ Tree) + 
  geom_line()

Very important for visualizing complex data sets. The individual facets are properly sized and aligned, and inherit aesthetics to make comparisons easy.

Multi-panel plots

For combining different plots, use patchwork

library(patchwork)
p1 <- ggplot(esoph, aes(agegp, ncases/ncontrols, color=alcgp)) +
  stat_smooth(aes(group=alcgp), se=FALSE) +
  scale_color_brewer(palette = "Set1")
p2 <- ggplot(esoph, aes(agegp, ncases/ncontrols, color=tobgp)) + 
  stat_smooth(aes(group=tobgp), se=FALSE) +
  scale_color_brewer(palette = "Set1")
p1 + p2

Showing variability: Errorbars

load("../data/Weather_2014.RData")
summary(weather_day)
##      month             day           TempAve           TempMin      
##  Min.   : 1.000   Min.   : 1.00   Min.   : 0.6423   Min.   :-1.993  
##  1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 5.6862   1st Qu.: 3.431  
##  Median : 7.000   Median :16.00   Median : 9.0329   Median : 6.486  
##  Mean   : 6.526   Mean   :15.72   Mean   : 9.4095   Mean   : 6.664  
##  3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:13.1557   3rd Qu.:10.230  
##  Max.   :12.000   Max.   :31.00   Max.   :19.2297   Max.   :14.720  
##     TempMax            Rain       
##  Min.   : 2.273   Min.   : 0.000  
##  1st Qu.: 8.340   1st Qu.: 0.000  
##  Median :12.030   Median : 0.400  
##  Mean   :12.438   Mean   : 3.192  
##  3rd Qu.:16.620   3rd Qu.: 2.800  
##  Max.   :24.470   Max.   :46.800
ggplot(weather_day, aes(month, TempAve)) +
  stat_summary(fun.y=mean, geom="line") + 
  stat_summary(fun.data=mean_se, geom="errorbar")

Showing variability: Jittered points

ggplot(weather_day, aes(month, TempAve)) +
  geom_jitter(width=0.2)

Showing variability: Boxplots

ggplot(weather_day, aes(as.factor(month), TempAve)) +
  geom_boxplot()

Showing variability: Violin plots

ggplot(weather_day, aes(as.factor(month), TempAve)) +
  geom_violin()

Showing variability: Combinations

ggplot(weather_day, aes(as.factor(month), TempAve)) +
  geom_violin() +
  geom_jitter(width=0.1, alpha=0.25) + 
  geom_smooth(aes(x=month))

Refining figures

Example

ggplot(weather_day, aes(month, TempAve)) +
  geom_smooth(color="black") +
  geom_jitter(width=0.1, alpha=0.5, aes(color=TempAve)) +
  theme_bw(base_size=12) + 
  scale_color_gradient(low="navy", high="red", guide=FALSE) +
  labs(x="Month", y="Temperature", 
       title="Edinburgh temperatures in 2014")

Data management with tidyverse

Raw data are rarely in a convenient format for analysis.

Need to be pre-processed to be ready for analysis.

Approx. 80% of analysis time is spent on getting data ready for analysis.

Basics

Data Transformation: dplyr

Data Transformation: tidyr

Tidy data

“Happy families are all alike; every unhappy family is unhappy in its own way.” - Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” - Hadley Wickham

Data Transformation: tidyr

Tidy data

Pipes

Easy way to do a series of operations:

g(f(x, y), z) is the same as x %>% f(y) %>% g(z)

But easier to read and write: “Take x then do f(y) then do g(z)

Pipes

(source: https://twitter.com/dmi3k/status/1191824875842879489)

Example: Public Health England data on mental health

A few interesting and “robust” indicators:

library(tidyverse)
## -- Attaching packages ----------------------------------------- tidyverse 1.3.0 --
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0
## <U+2713> purrr   0.3.3
## -- Conflicts -------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(fingertipsR)
#Get data from the PHE server
mh_phe <- fingertips_data(IndicatorID = c(848, 90646, 90275, 41001),
                          AreaTypeID = 102) %>%
  select(IndicatorID, IndicatorName, 
         County = AreaName, Region = ParentName, 
         Timeperiod, Value) %>%
  mutate(Year = as.numeric(str_sub(Timeperiod, 1, 4))) %>%
  filter(Region != "England") %>%
  mutate(Region = str_remove_all(Region, " region"))

Example: Public Health England data on mental health

load("../data/PHE_MentalHealth.Rdata")
summary(mh_phe)
##   IndicatorID    IndicatorName         County             Region         
##  Min.   :  848   Length:9422        Length:9422        Length:9422       
##  1st Qu.:41001   Class :character   Class :character   Class :character  
##  Median :41001   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :44945                                                           
##  3rd Qu.:41001                                                           
##  Max.   :90646                                                           
##                                                                          
##   Timeperiod            Value              Year     
##  Length:9422        Min.   : 0.0532   Min.   :2001  
##  Class :character   1st Qu.: 5.6154   1st Qu.:2006  
##  Mode  :character   Median : 9.5482   Median :2011  
##                     Mean   :12.7162   Mean   :2010  
##                     3rd Qu.:14.2544   3rd Qu.:2014  
##                     Max.   :71.1897   Max.   :2017  
##                     NA's   :619
str(mh_phe)
## 'data.frame':    9422 obs. of  7 variables:
##  $ IndicatorID  : int  848 848 848 848 848 848 848 848 848 848 ...
##  $ IndicatorName: chr  "Depression: Recorded prevalence (aged 18+)" "Depression: Recorded prevalence (aged 18+)" "Depression: Recorded prevalence (aged 18+)" "Depression: Recorded prevalence (aged 18+)" ...
##  $ County       : chr  "Hartlepool" "Middlesbrough" "Redcar and Cleveland" "Stockton-on-Tees" ...
##  $ Region       : chr  "North East" "North East" "North East" "North East" ...
##  $ Timeperiod   : chr  "2013/14" "2013/14" "2013/14" "2013/14" ...
##  $ Value        : num  5.32 5.74 7.86 9.19 7.38 ...
##  $ Year         : num  2013 2013 2013 2013 2013 ...

Table: Health Indicators by Region

mh_phe %>% group_by(IndicatorName, Region) %>%
  summarize(Value = mean(Value, na.rm = TRUE)) %>%
  pivot_wider(names_from = IndicatorName, values_from = Value)
## # A tibble: 9 x 5
##   Region  `Depression: QOF … `Depression: Rec… `Percentage of ph… `Suicide rate`
##   <chr>                <dbl>             <dbl>              <dbl>          <dbl>
## 1 East M…               1.49              8.68               56.9           9.90
## 2 East o…               1.35              8.03               56.6           9.96
## 3 London                1.00              6.01               57.3           9.61
## 4 North …               1.47              9.01               52.9          12.4 
## 5 North …               1.65              9.69               53.1          11.6 
## 6 South …               1.27              8.05               58.7          10.6 
## 7 South …               1.33              8.30               58.6          10.9 
## 8 West M…               1.44              8.73               54.1          10.1 
## 9 Yorksh…               1.43              8.80               55.2          10.7

Table: Health Indicators by Year

mh_phe %>% group_by(IndicatorName, Year) %>%
  summarize(Value = mean(Value, na.rm = TRUE)) %>%
  pivot_wider(names_from = IndicatorName, values_from = Value) %>%
  arrange(Year)
## # A tibble: 17 x 5
##     Year `Depression: QOF … `Depression: Reco… `Percentage of ph… `Suicide rate`
##    <dbl>              <dbl>              <dbl>              <dbl>          <dbl>
##  1  2001              NA                 NA                  NA             11.2
##  2  2002              NA                 NA                  NA             11.0
##  3  2003              NA                 NA                  NA             10.9
##  4  2004              NA                 NA                  NA             10.8
##  5  2005              NA                 NA                  NA             10.4
##  6  2006              NA                 NA                  NA             10.3
##  7  2007              NA                 NA                  NA             10.4
##  8  2008              NA                 NA                  NA             10.3
##  9  2009              NA                 NA                  NA             10.2
## 10  2010              NA                 NA                  NA             10.3
## 11  2011              NA                 NA                  NA             10.6
## 12  2012              NA                 NA                  55.7           10.7
## 13  2013               1.04               6.44               55.5           10.8
## 14  2014               1.18               7.24               56.5           10.6
## 15  2015               1.43               8.17               56.5           10.4
## 16  2016               1.51               9.04               NA             10.3
## 17  2017               1.55               9.78               NA             NA

Reproducible Research

Workflow

  1. Collect data
  2. Read in raw data: read.table, haven::read_sav, readxl::read_excel, load(url(...)), etc.
  3. Check the data, make exploratory plots: ggplot
  4. Pre-process the raw data to get it ready for analysis: filter, group_by, mutate, gather, spread, etc.
  5. Check the data, make exploratory plots: ggplot
  6. Analyze the data: t.test, lm, lmer, sem, etc.
  7. Make final publication-quality figures: ggplot
  8. Write the report

Use R Markdown for steps 2-7 to produce a fully reproducible analysis report.

(Can also do step 8 and generate a final PDF or Word document of the full report)

R Markdown

R Markdown: Getting Started

Prerequisites

Create a markdown file

R Markdown: Template

R Markdown: Example

Excerpt from the R Markdown file for this lecture:

Key points